<scp>smartsnp</scp> , an <scp>r</scp> package for fast multivariate analyses of big genomic data
نویسندگان
چکیده
Determining the genetic make-up of populations (‘population structure’) is a major area research in multiple disciplines science (Habel et al., 2015; Helyar 2011). Principal component analysis (PCA: Hotelling, 1933; Pearson, 1901) foundational analytical tool evolutionary (Cavalli-Sforza & Piazza, 1975) and remains one most popular statistical methods for summarizing population structure genomic era—essentially, because underlying mathematical theory conceptually simple (Fenderson 2020) PCA outputs have clear interpretation (François Gain, 2021; McVean, 2009; Peter, 2021). However, magnitude data generated by modern high-throughput sequencing technologies poses substantial computing challenges other genomics applications (Schork, 2018; Tripathi 2016) that mandate development fast robust programming pipelines open-source platforms (e.g. Abraham Inouye, 2014; Luu 2017). fundamental step EIGENSOFT software suite—the current field standard research—which comprises two modules: (a) EIGENSTRAT (Price 2006) accounts ancestral relatedness genome-wide disease studies contrasting affected individuals controls (b) POPGEN (Patterson runs algorithm (SMARTPCA) expected allele-frequency dispersion caused drift biallelic single nucleotide polymorphisms (SNP). The wide utility this illustrated >8,000 combined citations (Scopus; accessed November seminal papers describing functionality 2006; Price 2006), citing publications include areas like animal domestication Orlando 2013; Qiu 2015) extinction risk Frandsen 2020; Liu 2018), human (pre)history Lazaridis Tishkoff 2009) (Khera 2016; Zhang 2009). SMARTPCA currently only available use Unix command-line environments therefore limited to scientists who are familiar with bioinformatic language. Here we present r package smartsnp user-friendly computation on large SNP datasets (Herrando-Pérez Huber Herrando-Pérez, Crucially, incorporates commonly used functionalities SMARTPCA: appropriate scaling genotypes control projection ancient samples onto space from samples. Additionally, includes allows users contrast ordination against permutational multivariate ANOVA tests structure, which unavailable EIGENSOFT. universality R language scientific (Tippmann, 2014), speed, simplicity should be attractive properties growing community investigating humans taxa. compatible versions 3.6.3 (29/02/2020) upwards Linux, Mac Windows systems, four functions: smart_pca, smart_permanova, smart_permdisp smart_mva (see summary arguments Table 1). In following subsections, explain benchmark those functions, provide descriptions implemented input-data formats SNP-scaling options. Functions smart_permanova implement PCA, variance (PERMANOVA, Anderson, 2001) (PERMDISP, respectively. function wrapper any combination three standalone functions. rationale these expanded Supporting Information S2. Briefly, recalculates geometric position (variables × samples) rigidly rotating system j orthogonal axes (variables) such i points (samples) maximized along rotated axes. For genotype data, SNPs variables genotyped PERMANOVA PERMDISP differences relative (location) spread (dispersion) sample groups (populations) using permutations triangular matrix (sample sample) containing pair-wise inter-sample proximities. Measuring proximities as Euclidean distances global testing location within via full j-multidimensional space, or alternatively lower-dimensional first principal typically subjected visual inspection inference). Importantly, application requires defined priori (before undertaking analyses) associated metadata theory, posteriori groupings derived philosophically statistically flawed. Function smart_pca seven steps (Figure 1): (1) loading (2) indexing (group assignment, versus ancient) will removed downstream analysis, (3) removing invariant SNPs, (4) imputing missing values (coded either NA 9), (5) (unscaled, centred, scaled z-scores drift), (6) value decomposition (SVD: canonical truncated) (7) optionally projecting space. addition (1)–(5), (8) partition an framework (9) estimate probability (α) group given null hypothesis no between groups. can compute and/or run. All functions conclude their computations extracting pertinent results storing them named elements list This assigned object environment, each element its name. Examples how run explained documentation our package, simulated (README file) real (vignette) examining flyways cosmopolitan bird (Kraus 2013). (g) taken [0|1|2] diploid organisms based number copies non-reference alleles. instance, reference allele G variant T g(GG) = 0 (homozygous reference), g(GT) 1 (heterozygous) g(TT) 2 non-reference). Genotypes haploid polyploid similarly package. accepts formats: generic text file without row (SNP) column (sample) names, *.geno uncompressed (EIGENSTRAT) compressed/binary (PACKEDANCESTRYMAP) format (https://reich.hms.harvard.edu/software). stored VCF PLINK (Chang Zhang, 2016), step-by-step instructions converting into flat handled provided vignette GitHub repository (https://christianhuber.github.io/smartsnp/articles). Handling achieved removal ≥1 value, imputation means (Marchini Howie, 2010). Users required vector assigning groups: files, often obtained 3rd *.ind file, also alpha-numeric identifiers (1st column) user-predefined descriptors sexes (2nd column). was conceived genetics so suite 22 (autosomal) chromosomes default. If >22 parameter numchrom (number chromosomes) unmodified, subsets 1–22. Our autosomes with/without sex chromosomes, out discrete sets (by number) excluded PERMDISP. When specifying (Table We expedited runtime at key computational bottlenecks: SVD computation. vroom::vroom_fwg (Hester Wickham, fast-conversion fixed-width files (EIGENSTRAT), internal C++ customized emulate admixtools::read_packedancestrymap (Maier Patterson, (PACKEDANCESTRYMAP). data.table::fread (Dowle Srinivasan, 2019), automatically detects extension separators. To reduce load memory, zero (same across default, make contribution partitioning. further applying truncated (calculation predefined axes) RSpectra::svds (Qiu Mei, rather than all bootSVD::fastSVD (Fisher, 2015). Computation much faster big benchmarking below), option depend dimensions investigation. benchmarked microbenchmark::microbenchmark (Mersmann, 2019) 34 (described S3) ways. compared times different sizes. Tables S2 S3, report mean errors (10 runs) smartsnp's Runtime increased through smart_permdisp, indicating resource-consuming calculations were α-value estimation PCA. Notable speed gains occurred function; dataset, 1–3 orders running separately, former needs once. On average, dataset 100 S2), took ≤30 s ≤1 million <1 <6 min 5 10 100,000 S3), <2 ≤500 samples, 50 7 1,000 <5 hr 5,000 Truncated up 3 19 increasing S2) Figure S1 S4, smartsnp::smart_pca EIGENSOFT'S SMARTPCA. computed (1 core), >4× SMARTPCA; when computed, ~2× largest multithreading (4 cores), 2× varying amounts both had similar speeds sizes S4). improvements come cost memory efficiency, more random-access (RAM) equivalently sized (though usage levels not onerous RAM specifications; see below). Palaeogenomics rapidly (Brunson Reich, uses DNA (aDNA) recovered specimens over least last 500,000 years (Pääbo 2004; Slatkin Racimo, investigate (pre)historical questions. degradation aDNA abundant bases, challenging subsequent analyses. Among approaches handle (reviewed Ausmees, 2019; Günther Jakobsson, implements ‘Projection Model Plane’ after Nelson al. (1996)—the method field, performed 2006). only, projected linear regression. coordinates particular subset equal coefficient (slope) fit origin (Nelson 1996), where response non-missing sample, predictor vector(s) coefficients (loadings) example, equates model predictors (or vectors) predictors. provides choice axes, 2, 4, 3, 6, etc. analysed previously published examined (2016) anatomically Homo sapiens Pilot (2019) grey wolves Canis lupus. quantified match (in EIGENSOFT) smartsnp) metrics Legendre Legendre, 2012 details tests): Spearman correlation (Spearman, 1904) ranked positions axis analyses Mantel test (Mantel, 1967) matrix) correlations α 999 permutations. replicated al.'s (more vignette: https://christianhuber.github.io/smartsnp/articles). wolf formally tested whether reported supported during Mantel, quantify permuted resulting statistic larger observed empirical lower probabilities due chance. investigated farming >500 thousand 1,152 sampled West Eurasia—of 278 hunter-gatherers (spanning 12,000–1,400 BC). 2) mirrored (figure 1b capturing gradient European (left) Near East (right) 1, gradients 2. 0.999; while 0.969 (α 0.001). found same agreement SMARTPCA, 0.999 (for 2), 0.974 0.001 (PCA Such high support near-perfect packages. (RAM allocation 4,079 vs. 580 MB, 2.0 6.2 min, respectively). phylogeographical patterns 42,320 306 (8 populations) Eurasia North America. predictions, they hypothesized linkage disequilibrium (non-random association alleles loci) Eurasian proportionately distance American populations. 3) again figure 3d 2019). recapitulated (right), Asian Pleistocene (~35,000 BP) Taimyr lying clusters, separating Mexican (bottom) 3). 0.999, respectively; relatively small runtimes smarp_pca (2 s) comparable, 15 (442 29 MB). Based original related analyses, surmised sister lineage but relationship uncertain. After excluding wolf, diversity differed among 0.0001 tests; 2). having 0.0021 (with correction testing) comparisons total 16 (no multiple-testing correction) 9 (multiple-testing 21 < 0.1 median spatial medians lowest Minnesota highest wolves, exhibiting intermediate dispersions S2). Increased heterogeneity wolfs might indicate wider geographical range selected study form reintroduced experienced bottlenecks strong drift, magnify variability composition (Małgorzata Pilot, pers. comm., May Runtimes totalled 28 23 s, Differences neutral markers reflect demographic history (Charlesworth 2003), detect processes decrease bottlenecks) increase admixture) variation. More generally, ecology, species interpreted measure beta (Anderson quantifies turnover assemblages, serves stress (Warwick Clarke, 1993) signals impact environmental perturbations. Both interpretations analogous genetics. variable loci strongest effects phenotype certain ethnic (Fadhlaoui-Zid Solovieff 2010; Yu 2020), property has been indicator profiles (Horne Ioannidis Manichaikul 2012; Turajlic conduct exploratory confirm hypotheses about datasets. It applied living systems haploid, visualize complex relationships processes, useful phenotype, ancestry studies. (aDNA low-coverage data) long high-quality little available. mirror 2–4 user-friendly, platform-independent context. By providing groups, makes it possible potentially evolutionary, ecological sociocultural factors. thank Julia A. Pilowsky, Simon (J.) Tuke Joshua M. Schmidt guidance editing GitHub/CRAN, Robert Maier sharing code his admixtools::read_packedancestrymap, Małgorzata Wiesław Bogdanowicz kind provision assistance interpret genetic-dispersion results, H. S. Kraus correct bird-flyways vignettes, Fernando Valladares proof-reading Spanish Abstract (available online version article). https://reich.hms.harvard.edu/datasets. None declared. C.D.H. S.H.-P. idea carried benchmarking; wrote draft manuscript, manuals; R.T. optimized performance revised supervision throughout, read PACKEDANCESTRYMAP submitted CRAN, Zenodo. authors contributed manuscript revisions approved submission. peer review article https://publons.com/publon/10.1111/2041-210X.13684. freely user's manual under MIT licence Comprehensive Archive Network (CRAN: https://cran.r-project.org/web/packages/smartsnp), (https://github.com/ChristianHuber/smartsnp) Zenodo (Huber 2021: https://doi.org/10.5281/zenodo.5124765). Vignettes several package-usage examples https://christianhuber.github.io/smartsnp/articles. (name ‘dataSNP’) (columns) randomly (rows) comprising 9,886 coded 9s. A 364 696 mallards, Anas platyrhynchos, (2013). Website hyperlinks listed S1. Please note: publisher responsible content supporting information supplied authors. Any queries (other content) directed corresponding author article.
منابع مشابه
An Architecture for Security and Protection of Big Data
The issue of online privacy and security is a challenging subject, as it concerns the privacy of data that are increasingly more accessible via the internet. In other words, people who intend to access the private information of other users can do so more efficiently over the internet. This study is an attempt to address the privacy issue of distributed big data in the context of cloud computin...
متن کاملtourr: An R package for exploring multivariate data with projections
This paper describes an R package which produces tours of multivariate data. The package includes functions for creating different types of tours, including grand, guided, and little tours, which project multivariate data (p-D) down to 1, 2, 3, or, more generally, d (≤ p) dimensions. The projected data can be rendered as densities or histograms, scatterplots, anaglyphs, glyphs, scatterplot matr...
متن کاملThe Decoding Toolbox (TDT): a versatile software package for multivariate analyses of functional imaging data
The multivariate analysis of brain signals has recently sparked a great amount of interest, yet accessible and versatile tools to carry out decoding analyses are scarce. Here we introduce The Decoding Toolbox (TDT) which represents a user-friendly, powerful and flexible package for multivariate analysis of functional brain imaging data. TDT is written in Matlab and equipped with an interface to...
متن کاملHal: an Automated Pipeline for Phylogenetic Analyses of Genomic Data
The rapid increase in genomic and genome-scale data is resulting in unprecedented levels of discrete sequence data available for phylogenetic analyses. Major analytical impasses exist, however, prior to analyzing these data with existing phylogenetic software. Obstacles include the management of large data sets without standardized naming conventions, identification and filtering of orthologous...
متن کاملPETRA: Multivariate analyses for neuroimaging data
In last years, many research efforts in neurosciences have focused in multivariate approaches based on machine learning as an alternative to the use of Statistical Parametric Mapping and the univariate analyses that it provides. However, this relatively new field still lacks of a software framework that completely meets the needs of the scientific community. In this work we present a toolbox de...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Methods in Ecology and Evolution
سال: 2021
ISSN: ['2041-210X']
DOI: https://doi.org/10.1111/2041-210x.13684